Introduction

Those are three datasets we are using in this analysis report.

  1. Airport, train, stations and ferry terminal dataset
  2. Flight Route Database
  3. Airline Databese

Data Source

OurAirports

OurAirports is a free site where visitors can explore the world’s airports, read other peoplpe’s comments, and leave their own. The site is dedicated to both passengers and pilots. Users can find any airports around the world. The site started in 2007 to create a good source of global aviation data available to anyone.

OpenFlights

OpenFlights is a tool that lets users map their flights around the world, seach and filter them in all sorts of interesting ways, calculate statistics automatically, and share yout flights and trips with friends and the entire world. Also, OpenFlights is the open-source project to build the tool.

AIM

Preparation

library(data.table) # fast data import
library(tidyverse) # data manipulation
library(plotly) # interactive visualizations
library(janitor) # data manipulation
library(stringr) # character class data manipulation
library(treemap) # tree map visualization
library(igraph)
library(gridExtra)
library(ggraph)

getwd()
## [1] "C:/Users/USER/Desktop/Airline_Analysis"
setwd("C:\\Users\\USER\\Desktop\\Airline_Analysis\\input")
airport <- read_csv("airports-extended.csv", col_names = F)
names(airport) <- c("Airpot_ID", "Airport_Name", "City", "Country", "IATA", 
                 "ICAO", "Latitude", "Longitude", "Altitude", "Timezone", 
                 "DST", "Tz", "Type", "Source")
airport <- airport %>% 
    filter(Type == "airport")
airline <- read_csv("airlines.csv") %>% 
    clean_names()
route <- read_csv("routes.csv") %>% 
    clean_names()
names(route)[5] <- "destination_airport"

countries <- read_csv("countries of the world.csv")

Data Content: First 5 rows

Data No.1

airport %>% 
    head(5) %>% 
    DT::datatable(options = list(
        lengthMenu = c(5,3,1)
    ))
  • Airpot_ID: Airport ID Unique OpenFlights identifier for this airport.
  • Airport_Name: Name of airport. May or may not contain the City name.
  • City: Main city served by airport. May be spelled differently from Name.
  • Country: Country or territory where airport is located. See countries.dat to cross-reference to ISO 3166-1 codes.
  • IATA: 3-letter IATA code. Null if not assigned/unknown.
  • ICAO: 4-letter ICAO code. Null if not assigned.
  • Latitude: Decimal degrees, usually to six significant digits. Negative is South, positive is North.
  • Longitude: Decimal degrees, usually to six significant digits. Negative is West, positive is East.
  • Altitude: Altitude In feet.
  • Timezone: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
  • DST: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time
  • Tz: database time zone Timezone in “tz” (Olson) format, eg. “America/Los_Angeles”.
  • Type: Type of the airport. Value “airport” for air terminals, “station” for train stations, “port” for ferry terminals and “unknown” if not known.
  • Source: Source of this data. “OurAirports” for data sourced from OurAirports, “Legacy” for old data not matched to OurAirports (mostly DAFIF), “User” for unverified user contributions. In airports.csv, only source=OurAirports is included.

This dataset covers 7750 objects.

Data No.2

airline %>% 
    head(5) %>% 
    DT::datatable(options = list(
        lengthMenu = c(5,3,1)
    ))
  • airline_id: Unique OpenFlights identifier for this airline.
  • name: Name of the airline.
  • alias: Alias of the airline. For example, All Nippon Airways is commonly known as “ANA”.
  • IATA: 2-letter IATA code, if available.
  • ICAO: 3-letter ICAO code, if available.
  • callsign: Airline callsign.
  • country: Country or territory where airline is incorporated.
  • active: “Y” if the airline is or has until recently been operational, “N” if it is defunct.
    • This field is not reliable: in particular, major airlines that stopped flying long ago, but have not had their IATA code reassigned (eg. Ansett/AN), will incorrectly show as “Y”.

This dataset covers 6162 objects.

Data No.3

route %>% 
    head(5) %>% 
    DT::datatable(options = list(
        lengthMenu = c(5,3,1)
    ))
  • airline: 2-letter (IATA) or 3-letter (ICAO) code of the airline.
  • airline_id: Unique OpenFlights identifier for airline (see Airline).
  • source_airport: 3-letter (IATA) or 4-letter (ICAO) code of the source airport.
  • source_airport_id: Unique OpenFlights identifier for source airport (see Airport)
  • destination_airport: 3-letter (IATA) or 4-letter (ICAO) code of the destination airport.
  • destination_airport_id: Unique OpenFlights identifier for destination airport (see Airport)
  • codeshare: “Y” if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.
  • stops: Number of stops on this flight (“0” for direct)
  • equipment: 3-letter codes for plane type(s) generally used on this flight, separated by spaces

This dataset covers 135,326 objects.

Analysis

Global Airports Distribution

geo <- list(
  scope = "world",
  projection = list(type = "orthographic"),
  showland = TRUE,
  resolution = 100,
  landcolor = toRGB("gray90"),
  countrycolor = toRGB("gray80"),
  oceancolor = toRGB("lightsteelblue2"),
  showocean = TRUE
)

plot_geo(locationmode = "Greenwich") %>%
  add_markers(data = airport %>% 
                filter(Type == "airport"),
              x = ~Longitude,
              y = ~Latitude,
              text = ~paste('Airport: ', Airport_Name),
              alpha = .5, color = "red") %>%
  layout(
    title = "Global Airports",
    geo = geo,
    showlegend = FALSE
  )
print(paste("There are", airport %>% 
              filter(Type == "airport") %>% 
              nrow(), 
            "airports around the world."))
## [1] "There are 7750 airports around the world."

There are 7750 airports around the world, according to the dataset.

Global Airline route

route <- route %>% mutate(id = rownames(route))
route <- route %>% gather('source_airport', 'destination_airport', key = "Airport_type", value = "Airport")
gloabal.flight.route <- merge(route, airport %>% select(Airport_Name, IATA, Latitude, Longitude, Country, City),
      by.x = "Airport", by.y = "IATA")
world.map <- map_data ("world")
world.map <- world.map %>% 
  filter(region != "Antarctica")

ggplot() + 
  geom_map(data=world.map, map=world.map,
           aes(x=long, y=lat, group=group, map_id=region),
           fill="white", colour="black") +
  geom_point(data = gloabal.flight.route, 
             aes(x = Longitude, y = Latitude), 
             size = .1, alpha = .5, colour = "red") +
  geom_line(data = gloabal.flight.route, 
            aes(x = Longitude, y = Latitude, group = id), 
            alpha = 0.05, colour = "red") +
  labs(title = "Global Airline Routes")

ggplot() + 
  geom_map(data=world.map, map=world.map,
           aes(x=long, y=lat, group=group, map_id=region),
           fill="white", colour="grey") + 
  geom_point(data = airport %>% 
               filter(Altitude >= 5000),
             aes(x = Longitude, y = Latitude, colour = Altitude), 
             size = .7) +
  labs(title = "Airports located over 5,000 feet altitude") +
  ylim(-60, 90) +
  theme(legend.position = c(.1, .25))

print(paste(airport %>% 
              filter(Altitude >= 5000) %>% 
              nrow(), 
            "airports are located over 5,000 feet altitude."))
## [1] "298 airports are located over 5,000 feet altitude."

There are 298 airports that are located over 5,000 feet all over the world. Those are mainly distributed in the montanious areas such as Rocky, Andes, Himalaya… Also, Papua Newgenea has a few airports over 5,000 feet.

Which Country has the most Airports?

connection.route <- route %>% 
  spread(key = Airport_type, value = Airport) %>% 
  select(destination_airport, source_airport, id)
airport.country <- airport %>% 
  select(City, Country, IATA)
flight.connection <- merge(connection.route, airport.country, by.x = "source_airport", by.y = "IATA")
names(flight.connection)[4:5] <- c("source.City", "source.Country")

flight.connection <- merge(flight.connection, airport.country, by.x = "destination_airport", by.y = "IATA")
names(flight.connection)[6:7] <- c("destination.City", "destination.Country")

flight.connection <- flight.connection %>% 
  select(id, contains("source"), contains("destination"))
data.frame(table(airport$Country)) %>% 
  arrange(desc(Freq)) %>% 
  head(20) %>% 
  ggplot(aes(x = reorder(Var1, -Freq), y = Freq, fill = Var1, label = Freq)) + 
  geom_bar(stat = "identity", show.legend = F) +
  labs(title = "Top 20 Countries that has most Airports", 
       x = "Country", y = "The number of Airports") +
  geom_label(angle = 45, show.legend = F) +
  theme(axis.text.x = element_text(angle = 40, size = 15))

United States has by far the most airports. Probably this is because united states has many military bases around the world. By the way, nations with bigger territories, such as Russia, Canada has many airports because they need them to have access to remote cities. However, small countries such as Japan also ranked in top 20 countries. The number of airports are affected by how large a country is and how good the economy is, we can say.

Treemap Visualization

treemap(data.frame(table(airport$Country)),
        index="Var1",
        vSize="Freq",
        type="index",
        title = "Overall Number of Airport owned by each Nation")

Which Country has the most Airlines?

data.frame(table(airline$country)) %>% 
  arrange(desc(Freq)) %>% head(20) %>% 
  ggplot(aes(x = reorder(Var1, -Freq), y = Freq, 
             fill = Var1, label = Freq)) + 
  geom_bar(stat = "identity", show.legend = F) +
  geom_label(show.legend = F) +
  theme(axis.text.x = element_text(angle = 40, size = 15)) +
  labs(x = "Country", y = "The number of Airlines", 
       title = "Top 20 Countries that have most airlines")

Airports vs Airlines

country.airport <- data.frame(table(airport$Country))
names(country.airport)[2] <- "Airport"

country.airline <- data.frame(table(airline$country))
names(country.airline)[2] <- "Airline"

lineports <- merge(country.airport, country.airline, by = "Var1")
lineports %>% 
  ggplot(aes(x = Airport, y = Airline)) + 
  geom_point(show.legend = F) +
  geom_smooth() + 
  labs(title = "Airports vs Airlines") +
  scale_x_continuous(trans = 'log10',
                     breaks = c(10, 100, 500, 1000))

There is a positive correlation between the number of airport and airline.

Do anti-social countries have international flights? If yes, where?

In this section, we are going to focus on airports and airlines in North Korea.

NK.airport <- airport %>% filter(Country == "North Korea")
NK.flight.connection <- flight.connection %>% 
  filter(source.Country == "North Korea" | destination.Country == "North Korea")
NK.gloabal.flight.route.id <- 
  gloabal.flight.route %>% 
  filter(Country == "North Korea") %>% select(id)

NK.gloabal.flight.route.id <- NK.gloabal.flight.route.id$id %>% as.vector()

NK.gloabal.flight.route.id <- 
  gloabal.flight.route %>% filter(id %in% NK.gloabal.flight.route.id)
NorthKorea.ggmap <- ggplot() + 
  geom_map(data=world.map, map=world.map,
           aes(x=long, y=lat, group=group, map_id=region),
           fill="white", colour="black") +
  geom_point(data = NK.gloabal.flight.route.id, 
             aes(x = Longitude, y = Latitude), colour = "red") +
  geom_point(data = NK.airport, 
             aes(x = Longitude, y = Latitude), colour = "red") +
  geom_line(data = NK.gloabal.flight.route.id, 
             aes(x = Longitude, y = Latitude, group = id), colour = "red") + 
  xlim(100, 140) + ylim(0, 45) +
  labs(title = "Airports & International Airlines from/to/in North Korea") +
  coord_fixed(ratio = 1.1)

Flight.Country.Connection <- NK.flight.connection %>% 
  select(contains("Country"), id)
names(Flight.Country.Connection) <- c("From", "To", "id")
Flight.Country.Connection <- Flight.Country.Connection %>% 
  mutate(Combination = paste0(From, "-", To))
Flight.Country.Connection <- Flight.Country.Connection %>% 
  group_by(Combination) %>% 
  mutate(Weight = NROW(Combination)) %>% 
  arrange(-Weight) %>% 
  ungroup()
Flight.Country.Connection <- Flight.Country.Connection[!duplicated(Flight.Country.Connection$Combination),] %>% 
  select(-id, -Combination)
Flight.Country.graph <- graph_from_data_frame(Flight.Country.Connection, directed = FALSE)


Flight.Country.graph$name <- "Flight Country Network"
V(Flight.Country.graph)$id <- 1:vcount(Flight.Country.graph)


NW.Plot <- ggraph(Flight.Country.graph, layout = "kk") +
  geom_edge_link(aes(alpha = Weight), 
                 colour = "red") +
  geom_node_point(size = 5, colour = "red") +
  geom_node_text(aes(label = name), repel = TRUE, size = 7) +
  labs(title = "Flight Country Network", x = "", y = "") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.line = element_blank(),
        axis.text.x=element_blank(), axis.text.y=element_blank())

grid.arrange(NorthKorea.ggmap, NW.Plot, ncol=2)

Airport Information that have flight access to North Korea

NK.gloabal.flight.route.id %>% 
  filter(Country != "North Korea") %>% 
  select(Airport_Name, Country, City, Latitude, Longitude) %>% 
  distinct(City, .keep_all = T) %>% 
  DT::datatable(options = list(
    lengthMenu = c(4,1)
  ))

It turns out, there are three countries that have flight access to North Korea. Beijing and Shenyang in China, Vladivostok in Russia and Kuala Lumpur in Malaysia. Those are 4 cities that have airports with flights to North Korea.

Asian Countries’ Flights Network

country.connection <- flight.connection %>% 
  select(contains("Country")) %>% 
  mutate(Link = paste0(source.Country, "-", destination.Country))
country.connection <- country.connection[!duplicated(country.connection$Link),]
names(country.connection) <- c("from", "to", "Link" )

#### Selecting Asian Countries
Country.list <- countries %>% 
  select(Country, Region)
Country.list$Country <- as.character(Country.list$Country)
Asian.Country.list <- Country.list %>% 
  arrange(Region) %>% 
  head(28)
Asian.Country <- Asian.Country.list$Country
Asian.Country <- gsub(pattern = "\\, ", replacement = "", Asian.Country) %>% 
  gsub(pattern = " ", replacement = "", Asian.Country) %>% 
  gsub(pattern = "KoreaNorth", replacement = "North Korea", Asian.Country) %>% 
  gsub(pattern = "KoreaSouth", replacement = "South Korea", Asian.Country)

country.connection <- country.connection %>% 
  filter(from %in% Asian.Country & to %in% Asian.Country) %>% 
  select(-Link)

country.connection <- country.connection %>% 
  mutate(TF = str_detect(from, to)) %>% 
  filter(TF == "FALSE") %>% 
  select(-TF)
g <- graph_from_data_frame(country.connection, directed = TRUE)
V(g)$color <- ifelse(
  V(g)$name == "North Korea", "red", "yellow"
)

plot(g, layout =  layout_with_dh(g),
     edge.arrow.size=0.8,
     vertex.size = 17, vertex.label.cex = 2)

Here is a graph visualization showing flight network of asian countries. Red node shown in the graph represents North Korea with connections to other two countries such as China and Malaysia. As you can tell from the viz, the two countries have many connections.

Conclusion